AITopics | inequality follow

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.

artificial intelligence, machine learning, optimization problem, (14 more...)

arXiv.org Machine Learning

2605.26654

Country: Asia (0.27)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

Hamza, Ishaq, Chen, Zaiwei

arXiv.org Machine LearningMay-14-2026

In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity guarantee for finding an $ε$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2605.13639

Country: North America > United States (0.28)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Add feedback

Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions

Kanj, Haitham, Lee, Kiryung

arXiv.org Machine LearningMay-11-2026

This paper presents a parametric solution to piecewise linear regression through the Adaptive Block Gradient Descent (ABGD) algorithm. The heart of the method is the parametrization of piecewise linear functions as the difference of max-affine (DoMA) functions. A non-asymptotic local convergence analysis for ABGD is provided under sub-Gaussian covariate and noise distributions. To initialize ABGD, we adapt a prior algorithm originally developed for the simpler setting of max-affine functions. When suitably initialized, ABGD converges linearly to an $ε$-accurate estimate given $\tilde{\mathcal{O}}(d\max(σ_z/ε,1)^2)$ observations where $σ_z^2$ denotes the noise variance. This implies exact recovery given $\tilde{\mathcal{O}}(d)$ samples in the noiseless case. Also, such a rate is shown to be minimax optimal up to logarithmic factors. Synthetic numerical results corroborate the theoretical guarantees for ABGD. We also observe competitive performance compared to the state-of-the-art methods on real-world datasets.

artificial intelligence, machine learning, regression, (18 more...)

arXiv.org Machine Learning

2605.06959

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

31b3b31a1c2f8a370206f111127c0dbd-Supplemental.pdf

Neural Information Processing SystemsMay-1-2026, 02:05:00 GMT

artificial intelligence, cal, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

15a50c8ba6a0002a2fa7e5d8c0a40bd9-Supplemental.pdf

Neural Information Processing SystemsMay-1-2026, 01:50:19 GMT

artificial intelligence, data mining, machine learning, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.47)

Add feedback

f334c3375bd3744e98a0ca8eaa2403b0-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 07:23:03 GMT

artificial intelligence, data mining, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.45)
Europe (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)
Information Technology > Data Science > Data Mining (0.68)

Add feedback

An Exploration-by-Optimization Approach to Best of Both Worlds in Linear Bandits

Neural Information Processing SystemsApr-30-2026, 02:09:01 GMT

In this paper, we consider how to construct best-of-both-worlds linear bandit algorithms that achieve nearly optimal performance for both stochastic and adversarial environments. For this purpose, we show that a natural approach referred to as exploration by optimization [Lattimore and Szepesvári, 2020b] works well.

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.64)

Add feedback

AUnified Model and Dimension for Interactive Estimation

Neural Information Processing SystemsApr-29-2026, 19:05:04 GMT

We study an abstract framework for interactive learning called interactive estimation in which the goal is to estimate a target from its "similarity" to points queried by the learner. We introduce a combinatorial measure called dissimilarity dimension which is used to derive learnability bounds in our model. We present a simple, general, and broadly-applicable algorithm, for which we obtain both regret and PAC generalization bounds that are polynomial in the new dimension. We show that our framework subsumes and thereby unifies two classic learning models: statistical-query learning and structured bandits. We also delineate how the dissimilarity dimension is related to well-known parameters for both frameworks, in some cases yielding significantly improved analyses.

artificial intelligence, dimension, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Noise-Adaptive Thompson Sampling for Linear Contextual Bandits

Neural Information Processing SystemsApr-27-2026, 01:58:01 GMT

Linear contextual bandits represent a fundamental class of models with numerous real-world applications, and it is critical to developing algorithms that can effectively manage noise with unknown variance, ensuring provable guarantees for both worst-case constant-variance noise and deterministic reward scenarios.

artificial intelligence, data mining, machine learning, (20 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

AThe Algorithm

Neural Information Processing SystemsApr-26-2026, 02:38:20 GMT

Construct optimistic MDP fMk and compute optimistic policy πk (Algorithm 5). When the counter is 0 it gets (s,a), i.e., Ωi,e = (s,a,). When the counter is 1, we take (s,a) from ωn and map them to ωn/2 while eliminating half of the factors in consideration with the consistent scope Zi chosen by the policy (stored in factor 2d+ 1 + iof the state). It is handled similarly to the previous item, but considers the reward consistent scope zj chosen by the policy (stored in factor 3d+ 1 + j of the state). For i = 1,...,d, the i-th factor is taken from factor i of the previous state when the counter is not log n + 1, and otherwise performs the optimistic transition of factor i. Denote the value in the last factor of Ωi,e by ve, the policy's chosen scope by Zi (stored in factor 2d+ 1 + iof the state) and the policy's chosen next state direction by s0i (stored in factor d+ 1 + iof the state).

artificial intelligence, failure event, scope size, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.94)

Add feedback

Filters

Collaborating Authors

inequality follow

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

Locally Near Optimal Piecewise Linear Regression in High Dimensions via Difference of Max-Affine Functions

31b3b31a1c2f8a370206f111127c0dbd-Supplemental.pdf

15a50c8ba6a0002a2fa7e5d8c0a40bd9-Supplemental.pdf

f334c3375bd3744e98a0ca8eaa2403b0-Supplemental-Conference.pdf

An Exploration-by-Optimization Approach to Best of Both Worlds in Linear Bandits

AUnified Model and Dimension for Interactive Estimation

Noise-Adaptive Thompson Sampling for Linear Contextual Bandits

AThe Algorithm